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We have found that known community identification algorithms produce inconsistent communities 
when the node ordering changes at input. We propose two metrics to quantify the level of consistency 
across multiple runs of an algorithm: pairwise membership probability and consistency. Based on 
these two metrics, we address the consistency problem without compromising the modularity. Our 
solution uses pairwise membership probabilities as link weights and generates consistent communities 
within six or fewer cycles. It offers a new tool in the study of community structures and their 
evolutions. 

PACS numbers: 89.75.-k, 89.75.Hc 



Understanding and identifying community structure in 
a complex network has been one of the major research 
topics in sociology, physics, biology, and computer sci- 
ence [l[ . Various algorithms for discovering communities 
and modules in networks have been proposed: Some are 
based on betweenness and similar measures by removing 
inter-community links 0, Ej[ ■ Others use cliques_0 , in- 
formation theory 0] , random walks on networks sim- 
ilarity among partitions 0], and the list is not exhausted. 

Among these algorithms, greedy modularity maximiza- 
tion is one of the prevalent approaches for community 
identification. The modularity, Q, is a quality measure 
of partitioned communities. It is defined as: 

Q = E( e « ^ a i) (!) 

i 

where en is the ratio of the number of links between nodes 
belonging to community i over all links and Oj is the ratio 
of all links that cross the boundary of community i over 
all links. The value of modularity ranges from -1 to 1. 
The value Q = implies that the number of links within 
a community is no better than random. 

Modularity maximization methods (MMMs) are effec- 
tive in identifying and uncovering community structure 
in networked systems, but they have some limitations. 
For example, MMMs fail to identify communities smaller 
than a certain scale, which is known as the resolution 
limit [|. 

In this work we report another limitation of MMMs, 
namely, the inconsistency among identified communities 
in multiple runs of an algorithm. Using empirical net- 
work data, we show that all algorithms we have reviewed 
produce inconsistent communities every time the node 
names are reordered while the structure of the network 
remains unchanged. 
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We consider three community identification algo- 
rithms: Clauset-Newman-Moore (CNM) @, Wakita [iq |. 
and Louvain (Tlj . They all take a greedy approach in 
modularity maximization and are the only known algo- 
rithms to work for large networks. However, they all pro- 
duce different values of modularity for the same network. 
Even a single algorithm produces different modularities 
when the input order of nodes changes. We show an 
example to illustrate the inconsistency even in a small 
well-studied network. The identified communities in a 
network by the Louvain algorithm under three different 
orderings of nodes are shown in Fig. [T] Although the 
network has a small number of 34 nodes, identified com- 
munities in Fig. [lja), (b), and (c) are quite different and 
have different modularities. This example demonstrates 
that even for a small network, the input order plays a cru- 
cial role in determining community structure in complex 
networks. 

The huge number of ways to partition a graph makes 
it impossible to optimize modularity exhaustively. From 
a macroscopic view this is fine as long as the modularity 
varies not too much. However, if we are interested in 
network analysis from a nodal perspective, that is, iden- 
tifying a community a node belongs to, it does not make 
sense for the node to belong to a complete different com- 
munity every time the input order is perturbed. For ex- 
ample, we have two snapshots of a growing network taken 
a year apart. How has the community of a node grown in 
a year? This question is about evolutionary clustering, 
and inconsistent communities are a problem. What we 
address in this work is the inconsistency not even over 
the course of evolution, but within a single snapshot. If 
the community identification algorithm is so sensitive to 
the order of the input and produces completely differ- 
ent communities from a node's perspective, we cannot 
answer the question raised in the example. Thus before 
we identify the community a node belongs to, we should 
ask: how consistent is the community membership across 
different input orders? 

Over N runs of an algorithm, each with a randomly 
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ordered input set, we quantify the likelihood of a pair of 
nodes resulting in the same community as: 



En=l 3n( c i, c j) 



Pij 



N 



(2) 



where 



1, if Cj = Cj in the nth dataset 
0, otherwise 



and i and j are node indices and C; and Cj represent com- 
munities that i and j belong to, respectively. We call 
this metric pairwise membership probability. The pair- 
wise membership probability p^ represents the empirical 
probability that two nodes belong to the same commu- 
nity across multiple runs of the same algorithm. We can 
compute for all possible pairs of nodes. However, for 
any specific i, pij is likely to be for most of j due to the 
sparsity of links in the network, and this tendency grows 
with the network size. Therefore, we consider pij only for 
those adjacent nodes; that is, only between neighboring 
nodes. 

The pairwise membership probability of 1 means that 
the two neighboring nodes always belong to the same 
community and means that the two never belong to 
the same community irrespective of the input order. The 
larger the number of pairs whose empirical pairwise mem- 
bership probability is close to either or 1 is, the more 
consistent the identified communities are. While pij close 
to 1/2 means that i and j can be in the same community 
more or less randomly. 

In order to quantify network-wide community mem- 
bership consistency, we define a metric of consistency C 
for the entire network as: 



C 



Yl [r >> 



(Pij ~ 1/2) 2 



\E\ 



(0.5)2 



(3) 



and E is the set of links and \E\ is the number of links. 
The consistency C weighs the pairwise membership prob- 



abilities away from 1/2. The multicative term in nor- 
malizes C from to 1. 
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FIG. 2: [Color Online] Consistency of community identifica- 
tion 



We have analyzed consistency in community member- 
ships of seven empirical systems from various fields such 
as the karate club [12], dolphin social network [l3j], the 
co-appearance network of characters in the novel Les 
Miserables the adjacency network of common ad- 
jectives and nouns in the novel David Copperfield [l5l ]. 
the regular season network of American football games 
between Division IA colleges during the Fall 2000 0, a 
directed network of hyperlinks between weblogs on US 
politics [IB] and the network of coauthorships between 
scientists posting preprints on the Condensed Matter E- 
Print Archive [17| • Table [T] shows basic statistics of the 
seven networks. 

In case of communities detected by the CNM algorithm 
in the Karate club, 12.8% of the pairwise membership 
probabilities are and the rest of the pairs have 1, which 
means that nodes of a community always belong to the 
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TABLE I: Summary of the statistics of the network structure for the three empirical networks. N is the number of nodes, L is 
the number of links, and C is the global clustering coefficient. 
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Football Political Condensed 
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blog 




N 


34 


62 


77 


112 


115 


1222 


36458 


L 


78 


159 


254 


425 


613 


16714 


171736 


(k) 


4.6 


5.1 


6.6 


7.6 


10.7 


27.4 


9.4 


C 


0.57 


0.26 


0.57 


0.17 


0.40 


0.32 


0.66 



same community over N runs: C = 1. In Fig. [2] we show 
the consistency from the three algorithms. There is no 
one algorithm that outperforms the other two in all net- 
works and no consistent correlation between the consis- 
tency and the topological characteristics of the network, 
such as network size, average degree and average cluster- 
ing coefficient. However, a closer look at pairwise mem- 
bership probabilities reveals that in all networks far more 
than 50% of pairs have pairwise membership probabilities 
either smaller than 0.2 or greater than 0.8 It means 
that most pairs of nodes are never in the same community 
or always in the same community, respectively. Based 
on this observation, we devise a consistency reinforcing 
mechanism as follows. After each cycle of N runs, we 
calculate the pairwise membership probabilities and then 
assign them as link weights. From the second cycle on, 
we use this weighted network as an input and continue 
the cycle until C reaches 0.999 or higher. In a weighted 
network, an edge of a higher weight is placed within a 
community, while an edge of a lower weight bridges com- 
munities. Since we assign the pairwise membership prob- 
ability as the weight of the corresponding link, an edge of 
high pairwise membership probability in the prior cycle 
is more likely to be placed within a community in the 
next cycle. Therefore, links with higher weights are rein- 
forced through multiple cycles and eventually consistent 
communities emerge. 

Our approach has the effect of removing those links 
with pairwise membership probabilities of in the next 
cycle and spreading unit link weight between and 
1, thus reducing ties significantly in calculating AQ. 
When there are ties, can we give preference to nodes 
based on other metrics, such as degrees or betweenness 
centrality0? To assess the benefit of other metrics, if 
any, we order nodes by the degree, clustering coefficient, 
degree correlation, and betweenness centrality and com- 
pute modularity. Even if we employ all the metrics in 
tie breaking, we cannot eliminate ties completely [Lsj . In 
other words, no single topological characteristic consis- 
tently stands out to work better than others in all net- 
works. We have looked at edge betweenness as well, and 
found no correlation between edge betweenness and pair- 
wise membership probability. 

Our approach of reinforcing consistency in multiple cy- 
cles is applicable to any of the three algorithms. We in- 




FIG. 3: Convergence of consistency 



elude only the results from the Louvain algorithm in this 
paper, for it is the fastest and only one that scales up to 
billions of links. We report that the other two algorithm 
have similar results. 

The convergence of consistency after 5 cycles is shown 
in Fig. [3] All networks consistency reaches 1 in 5 cycles. 
In Fig. [4] we show how the modularity converges over 
5 cycles. The modularity converges almost to a single 
point after 2 cycles. Furthermore, the modularity after 
convergence is higher. Figure U demonstrates that our 
approach has no negative impact on modularity, and even 
improves it in certain networks. 

So far we have shown that our solution of using pair- 
wise membership probabilities as link weights has im- 
proved consistency greatly. Now we check if communi- 
ties from different trials come out identically. We turn 
our focus to individual communities in two independent 
trials. A cycle is N runs for a given network. A trial 
is M cycles of a given ordering of the network. We use 
M = 6 and N = 100. In order to check if the communi- 
ties are identical across trials, we calculate the maximum 
Jaccard coefficient (the ratio of intersection to union of 
two communities) of a community against all communi- 
ties of another trial. The Jaccard coefficient of 1 means 
that the same communities are produced in both trials. 
We compare the Jaccard coefficients for all pairs of tri- 
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FIG. 4: [Color Online] Convergence of modularity ('Un' indi- 
cates modularity of unweighted network) 



als and most Jaccard coefficients are found to be greater 
than 0.95. 

In summary, we have investigated the inconsistency 
among communities by existing community identification 
algorithms: CNM [g, Wakita [3, and Louvain [lj. 
Using empirical network data, we have shown that 
all three algorithms produce inconsistent communities 
every time the node ordering changes even if the size of 



networks are small. Similar results based on very large 
online social networks are also reported (l8l | . To quantify 
consistency of identified communities, we introduced 
pairwisc membership probability and consistency. The 
former quantifies the likelihood of two nodes resulting 
in the same community, and the latter represent the 
global level of consistency of a network, derived from 
pairwise membership probabilities. We analyze seven 
empirical networks in terms of the above two metrics 
and show that no one algorithm outperforms the other 
two in all networks. However, most pairwise member- 
ship probabilities are close to either or 1 (that is, 
never in the same community or always in the same 
community, respectively). Based on this observation, we 
have proposed a solution that improves the consistency 
without compromising the modularity. The key idea is 
to set the pairwise membership probability as the link 
weight and find communities in the weighted network 
iteratively. We have demonstrated the convergence of 
consistency within 6 or fewer cycles. Resulting commu- 
nities exhibit consistent grouping through multiple trials. 
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